Bayesian Logistic Regression - Application in Probability Prediction of disease (Diabetes)

CapStone Project_2025

Author

Namita Mishra, Autumn Wilcox, Ecil Teodoro (Advisor: Dr. Ashraf Cohen)

Published

October 21, 2025

Slides: slides.html ( Go to slides.qmd to edit)


Group Project Workflow and Contributions

  • Autumn Wilcox – Contributed to analytic coding, content draft, structured the project workflow, and collaborated actively via GitHub.
  • Ecil Teodoro – Drafted and submitted sections of the Introduction.
  • Namita Mishra – Developed the project plan, content draft, analytic coding, and coordinated commits and collaboration with the group on GitHub.

Advisor: Dr. Ashraf Cohen – Provided expert guidance on statistical methods and workflow management across platforms.

Methods Summary

Data Source: NHANES adults dataset Handling Missing Data: Multiple Imputation by Chained Equations (MICE) Survey Weighting: Accounted for population representativeness

Modeling Approaches:

Survey-weighted logistic regression (MLE) Multiple Imputation logistic model Bayesian logistic regression using NUTS sampling (4 chains, 2000 iterations)

Model Diagnostics: R̂ ≈ 1.00; High ESS → Excellent convergence Posterior Predictive Checks: Compared model predictions to observed prevalence

Introduction

Diabetes mellitus (DM) is a major public health concern closely associated with factors such as obesity, age, race, and gender. Identifying these associated risk factors is essential for targeted interventions D’Angelo and Ran (2025). Logistic Regression (traditional) that estimates the association between risk factors and outcomes is insufficient in analyzing the complex healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. Zeger et al. (2020). Classical maximum likelihood estimation (MLE) yields unstable results in samples that are small, have missing data, or presents quasi- and complete separation.

Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) allow analysis of multivariate longitudinal healthcare data with repeated measures within individuals and individuals nested in a population. By integrating prior knowledge and including exogenous (e.g., age, clinical history) and endogenous (e.g., current treatment) covariates, Bayesian models provide posterior distributions and risk predictions for conditions such as pneumonia, prostate cancer, and mental disorders. Parametric assumptions remain a limitation of these models.

In Bayesian inference Chatzimichail and Hatjimihail (2023), Bayesian inference has shown that parametric models (with fixed parameters) often underperform compared to nonparametric models, which do not assume a prior distribution. Posterior probabilities from Bayesian approaches improve disease classification and better capture heterogeneity in skewed, bimodal, or multimodal data distributions. Bayesian nonparametric models are flexible and robust, integrating multiple diagnostic tests and priors to enhance accuracy and precision, though reliance on prior information and restricted access to resources can limit applicability. Combining Bayesian methods with other statistical or computational approaches helps address systemic biases, incomplete data, and non-representative datasets.

The Bayesian framework described by R. van de Schoot et al. (2021) highlights the role of priors, data modeling, inference, posterior sampling, variational inference, and variable selection.Proper variable selection mitigates multicollinearity, overfitting, and limited sampling, improving predictive performance. Priors can be informative, weakly informative, or diffuse, and can be elicited from expert opinion, generic knowledge, or data-driven methods. Sensitivity analysis evaluates the alignment of priors with likelihoods, while MCMC simulations (e.g., brms, blavaan in R) empirically estimate posterior distributions. Spatial and temporal Bayesian models have applications in large-scale cancer genomics, identifying molecular interactions, mutational signatures, patient stratification, and cancer evolution, though temporal autocorrelation and subjective prior elicitation can be limiting.

Bayesian normal linear regression has been applied in metrology for instrument calibration using conjugate Normal–Inverse-Gamma priors Klauenberg et al. (2015). Hierarchical priors add flexibility by modeling uncertainty across multiple levels, improving robustness and interpretability. Bayesian hierarchical/meta-analytic linear regression incorporates both exchangeable and unexchangeable prior information, addressing multiple testing challenges, small sample sizes, and complex relationships among regression parameters across studies Leeuw and Klugkist (2012)

A sequential clinical reasoning model Liu et al. (2013) Sequential clinical reasoning models demonstrate screening by adding predictors stepwise: (1) demographics, (2) metabolic components, and (3) conventional risk factors, incorporating priors and mimicking clinical evaluation. This approach captures ecological heterogeneity and improves baseline risk estimation, though interactions between predictors and external cross-validation remain limitations.

Bayesian multiple imputation with logistic regression addresses missing data in clinical research Austin et al. (2021) in clinical research by classifying missing values (e.g., patient refusal, loss to follow-up, mechanical errors) as MAR, MNAR, or MCAR. Multiple imputation generates plausible values across datasets and pools results for reliable classification of patient health status and mortality.

Aims

The study aims to apply Bayesian logistic regression to predict diabetes status and to evaluate the associations between body mass index (BMI), age (≥20 years), gender, and race as predictors, using a retrospective dataset from the 2013–2014 NHANES survey. NHANES employs a complex sampling design, including stratification, clustering, and oversampling of specific population subgroups, rather than uniform random sampling. A Bayesian analytical approach is used to address challenges posed by dataset anomalies such as missing data, complete case analysis, and separation that limit the efficiency and reliability of traditional logistic regression in predicting health outcomes.

Method and Data Preparation

In this study, Bayesian logistic regression is applied to predict diabetes status and estimate the association of key predictors including body mass index (BMI), age, gender, and race using the 2013–2014 NHANES dataset. The complex survey design of NHANES, includes stratification, clustering, and oversampling, was accounted for to ensure accurate estimation. Posterior predictive distributions are generated to quantify the probability of diabetes for each individual, providing a flexible and robust framework for uncertainty quantification. Model predictions are also compared against known population prevalence, enabling validation and assessment of model realism. By integrating prior information with observed data, this approach identifies individuals at high risk for diabetes and informs targeted public health interventions, demonstrating the utility of Bayesian methods in analyzing complex, multivariate health data.

Statistical Tool: we use R, R packages and libraries to import data,

perform data wrangling and analysis.

Data source: NHANES 2-year data is a cross-sectional weighted data

(2013-2014 year) Center for Health Statistics (1999). Three datasets (demographics, exam, questionnaire) imported (Haven package to coverted .XPT files in R to dataframe (df)) and after selecting variables of interest a merged dataset was created.

Code
# loading packages 
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("nhanesA")
library ("nhanesA")    

library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
library(Hmisc)
library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)
library(classpackage)
library(janitor)
install.packages("gt")   
library(gt)
library(survey)
library(DataExplorer)
library(logistf)

Data pre-processing and cleaning Subsets created from the original weighted datasets (demographics, exam, questionnaire) with selected variables merged using ID to create a single dataframe.

  1. Response Variable: Binary, type 2 / diagnosed diabetes (excluding gestational diabetes) -“Doctor told you have diabetes?” DIQ010 combined with DIQ050 a secondary variable describing treatment status (insulin use) to exclude those cases
  2. Predictor Variables (Body Mass Index, factor, 4 levels are analyzed after standardization).
    o Underweight (<5th percentile)
    o Normal (5th–<85th)
    o Overweight (85th–<95th) o Obese (≥95th percentile)
    o Missing We kept it as it is as categorization provides clinically interpretable groups
  3. Covariates:
    1. Gender (factor, 2 levels): Male: Female
    2. Ethnicity (factor, 5 levels): Mexican American, Non-Hispanic, White Non-Hispanic, Black Other Hispanic, Other Race - Including multi-racial
    3. Age (number, continuous)

Merged dataset cleaned and exploratory data analysis results and visualizations are presented for 10175 observations and 10 variables with race categorized in 5 levels, age range (0-80 years), gender (male and female), Diabetes grouped from (DIQ010 and DIQ050) BMI as continuous.

Code
# weighted means of each variable                       
str(merged_data)
'data.frame':   10175 obs. of  10 variables:
 $ SEQN    : num  73557 73558 73559 73560 73561 ...
 $ RIDAGEYR: num  69 54 72 9 73 56 0 61 42 56 ...
 $ RIAGENDR: Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 1 2 1 2 ...
 $ RIDRETH1: Factor w/ 5 levels "Mexican American",..: 4 3 3 3 3 1 3 3 2 3 ...
 $ SDMVPSU : num  1 1 1 2 2 1 1 1 2 1 ...
 $ SDMVSTRA: num  112 108 109 109 116 111 105 114 106 112 ...
 $ WTMEC2YR: num  13481 24472 57193 55767 65542 ...
 $ BMXBMI  : num  26.7 28.6 28.9 17.1 19.7 41.7 NA 35.7 NA 26.5 ...
 $ DIQ010  : Factor w/ 5 levels "Yes","No","Borderline",..: 1 1 1 2 2 2 NA 2 2 2 ...
 $ DIQ050  : Factor w/ 4 levels "Yes","No","Refused",..: 1 1 1 2 2 2 NA 2 2 2 ...
Code
plot_str(merged_data)
head(merged_data)
   SEQN RIDAGEYR RIAGENDR           RIDRETH1 SDMVPSU SDMVSTRA WTMEC2YR BMXBMI
1 73557       69     Male Non-Hispanic Black       1      112 13481.04   26.7
2 73558       54     Male Non-Hispanic White       1      108 24471.77   28.6
3 73559       72     Male Non-Hispanic White       1      109 57193.29   28.9
4 73560        9     Male Non-Hispanic White       2      109 55766.51   17.1
5 73561       73   Female Non-Hispanic White       2      116 65541.87   19.7
6 73562       56     Male   Mexican American       1      111 25344.99   41.7
  DIQ010 DIQ050
1    Yes    Yes
2    Yes    Yes
3    Yes    Yes
4     No     No
5     No     No
6     No     No
Code
summary(merged_data)
      SEQN          RIDAGEYR       RIAGENDR   
 Min.   :73557   Min.   : 0.00   Male  :5003  
 1st Qu.:76100   1st Qu.:10.00   Female:5172  
 Median :78644   Median :26.00                
 Mean   :78644   Mean   :31.48                
 3rd Qu.:81188   3rd Qu.:52.00                
 Max.   :83731   Max.   :80.00                
                                              
                                RIDRETH1       SDMVPSU         SDMVSTRA    
 Mexican American                   :1730   Min.   :1.000   Min.   :104.0  
 Other Hispanic                     : 960   1st Qu.:1.000   1st Qu.:107.0  
 Non-Hispanic White                 :3674   Median :1.000   Median :111.0  
 Non-Hispanic Black                 :2267   Mean   :1.484   Mean   :110.9  
 Other Race - Including Multi-Racial:1544   3rd Qu.:2.000   3rd Qu.:115.0  
                                            Max.   :2.000   Max.   :118.0  
                                                                           
    WTMEC2YR          BMXBMI             DIQ010            DIQ050    
 Min.   :     0   Min.   :12.10   Yes       : 737   Yes       : 220  
 1st Qu.: 12562   1st Qu.:19.70   No        :8841   No        :9545  
 Median : 20175   Median :24.70   Borderline: 185   Refused   :   1  
 Mean   : 30585   Mean   :25.68   Refused   :   1   Don't know:   2  
 3rd Qu.: 36748   3rd Qu.:30.20   Don't know:   5   NA's      : 407  
 Max.   :171395   Max.   :82.90   NA's      : 406                    
                  NA's   :1120                                       
Code
library(dplyr)
library(knitr)



plot_intro(merged_data, title="Figure 1 (Merged dataset). Structure of variables and missing observations.")

Code
plot_missing(merged_data, title="Figure 2(Merged dataset). Breakdown of missing observations.")

Code
# print(glimpse(merged_data))
print(table(merged_data$BMDBMIC, useNA = "ifany"))
< table of extent 0 >
Code
print(table(merged_data$DIQ010,  useNA = "ifany"))

       Yes         No Borderline    Refused Don't know       <NA> 
       737       8841        185          1          5        406 
Code
# ---- Coercion helpers (handle labelled/character) ----
to_num <- function(x) {
  if (is.numeric(x)) return(x)
  xc <- as.character(x)
  n <- suppressWarnings(readr::parse_number(xc))
  if (mean(is.na(n)) > 0.80) {
    xlow <- tolower(trimws(xc))
    n <- dplyr::case_when(
      xlow %in% c("1","yes","yes, told") ~ 1,
      xlow %in% c("2","no","no, not told") ~ 2,
      xlow %in% c("3","borderline") ~ 3,
      xlow %in% c("7","refused") ~ 7,
      xlow %in% c("9","don't know","dont know","unknown") ~ 9,
      TRUE ~ NA_real_
    )
  }
  as.numeric(n)
}

merged_data <- merged_data %>%
  mutate(
    DIQ010   = to_num(DIQ010),
    DIQ050   = to_num(if (!"DIQ050" %in% names(.)) NA_real_ else DIQ050),
    BMXBMI   = suppressWarnings(as.numeric(BMXBMI)),
    RIDAGEYR = suppressWarnings(as.numeric(RIDAGEYR)),
    RIAGENDR = suppressWarnings(as.numeric(RIAGENDR)),
    RIDRETH1 = suppressWarnings(as.numeric(RIDRETH1)),
    SDMVPSU  = suppressWarnings(as.numeric(SDMVPSU)),
    SDMVSTRA = suppressWarnings(as.numeric(SDMVSTRA)),
    WTMEC2YR = suppressWarnings(as.numeric(WTMEC2YR))
  )

# ---- Diagnostics BEFORE save ----
cat("DIQ010 counts BEFORE save:\n")
DIQ010 counts BEFORE save:
Code
print(table(merged_data$DIQ010, useNA = "ifany"))

   1    2    3    7    9 <NA> 
 737 8841  185    1    5  406 
Code
cat("Count with DIQ010 in {1,2}:", sum(merged_data$DIQ010 %in% c(1,2), na.rm = TRUE), "\n")
Count with DIQ010 in {1,2}: 9578 
Code
# ---- Save to file for reuse ----
dir.create("data", showWarnings = FALSE)
# ---- Save ----
dir.create("data", showWarnings = FALSE, recursive = TRUE)
saveRDS(merged_data, "data/merged_2013_2014.rds")
message("Saved: data/merged_2013_2014.rds")

Exploratior data analysis - Used library(survey) to get weighted means and sd of the variables. The BMI and age were standardized. - Age was recoded into different variables, including only >20 years in the analysis. - BMI is recoded and categorized as-“18.5,18.5–<25,25–<30,30–<35,35–<40,≥40 years). - Ethnicity is recoded as”Mexican American” = “1”, “Other Hispanic” = “2”, “NH White” = “3”, “NH Black” = “4”, “Other/Multi” = “5” - Since special codes are not random, cannot be dropped; the informative missingness if ignored (MAR or MNAR) could introduce bias. We transformed special codes (3,7,) to NA and included all NAs in the analysis. Visulaization of missing data presented below. - A final analytic dataset was created (‘adult’) with “NH White” and “Male” as the reference group

Code
## 
# ---------------- Basic Exploration (adults) ----------------

# Keep adults only and build analysis variables
adult <- merged_data %>%
  dplyr::filter(RIDAGEYR >= 20) %>%
  dplyr::transmute(
    # --- keep survey design variables so svydesign() can see them ---
    SDMVPSU, SDMVSTRA, WTMEC2YR,

    # --- outcome: DIQ010 (1 yes, 2 no; 3/7/9 -> NA) ---
    diabetes_dx = dplyr::case_when(
      DIQ010 == 1 ~ 1,
      DIQ010 == 2 ~ 0,
      DIQ010 %in% c(3, 7, 9) ~ NA_real_,
      TRUE ~ NA_real_
    ),

    # --- predictors (raw) ---
    bmi  = BMXBMI,
    age  = RIDAGEYR,

    # sex (1=Male, 2=Female)
    sex  = forcats::fct_recode(factor(RIAGENDR), Male = "1", Female = "2"),

    # race (5-level)
    race = forcats::fct_recode(
      factor(RIDRETH1),
      "Mexican American" = "1",
      "Other Hispanic"   = "2",
      "NH White"         = "3",
      "NH Black"         = "4",
      "Other/Multi"      = "5"
    ),

    # keep DIQ050 so we can safely reference it (may be absent/NA in some rows)
    
    DIQ050 = DIQ050
  ) %>%
  # standardize continuous predictors
  dplyr::mutate(
    age_c = as.numeric(scale(age)),
    bmi_c = as.numeric(scale(bmi)),
    bmi_cat = cut(
      bmi,
      breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
      labels = c("<18.5","18.5–<25","25–<30","30–<35","35–<40","≥40"),
      right = FALSE
    )
  ) %>%
  # adjust outcome: if female & DIQ050==1 ("only when pregnant"), set to 0 (not diabetes)
  dplyr::mutate(
    diabetes_dx = ifelse(sex == "Female" & !is.na(DIQ050) & DIQ050 == 1, 0, diabetes_dx)
  )

# Make NH White the reference level for race (clearer interpretation)
adult <- adult %>%
  dplyr::mutate(
    race = forcats::fct_relevel(race, "NH White")
  )

# --- sanity checks ---
cat("Adults n =", nrow(adult), "\n")
Adults n = 5769 
Code
glimpse(adult)
Rows: 5,769
Columns: 12
$ SDMVPSU     <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ SDMVSTRA    <dbl> 112, 108, 109, 116, 111, 114, 106, 112, 112, 113, 116, 114…
$ WTMEC2YR    <dbl> 13481.04, 24471.77, 57193.29, 65541.87, 25344.99, 61758.65…
$ diabetes_dx <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ bmi         <dbl> 26.7, 28.6, 28.9, 19.7, 41.7, 35.7, NA, 26.5, 22.0, 20.3, …
$ age         <dbl> 69, 54, 72, 73, 56, 61, 42, 56, 65, 26, 76, 33, 32, 38, 50…
$ sex         <fct> Male, Male, Male, Female, Male, Female, Male, Female, Male…
$ race        <fct> NH Black, NH White, NH White, NH White, Mexican American, …
$ DIQ050      <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ age_c       <dbl> 1.13241831, 0.27835981, 1.30323001, 1.36016725, 0.39223428…
$ bmi_c       <dbl> -0.33588609, -0.07028101, -0.02834336, -1.31443114, 1.7609…
$ bmi_cat     <fct> 25–<30, 25–<30, 25–<30, 18.5–<25, ≥40, 35–<40, NA, 25–<30,…

Adult dataset and variables

Observations - 5769 observations Survey design: SDMVPSU, SDMVSTRA, WTMEC2YR Outcome: diabetes_dx (numeric 0/1) Covariates: bmi, age, sex, race, DIQ050 Centered covariates: age_c, bmi_c BMI categories: bmi_cat

NHANES is a national surveys based on complex sampling designs (oversampling certain groups (e.g., minorities, older adults) to ensure representation. They use multistage sampling to represent the U.S. population, so we apply sampling weights, strata, and PSU (primary sampling units) for valid estimates.

We use survey design in regression anlaysis to avoid - - bias prevalence estimates (e.g., mean BMI or diabetes %) - underestimation of standard errors - incorrect inference for population-level parameters.

Code
# data exploration

print(table(adult$diabetes_dx, useNA = "ifany"))

   0    1 <NA> 
4974  618  177 
Code
print(table(adult$sex, useNA = "ifany"))

  Male Female 
  2758   3011 
Code
print(table(adult$race, useNA = "ifany"))

        NH White Mexican American   Other Hispanic         NH Black 
            2472              767              508             1177 
     Other/Multi 
             845 
Code
if (sum(!is.na(adult$diabetes_dx)) == 0) {
  stop("Too few non-missing outcomes for modeling (n = 0). Check DIQ010 upstream.")
}

# (optional plots omitted for brevity)

# save for downstream
if (!dir.exists("data")) dir.create("data", recursive = TRUE)
saveRDS(adult, "data/adult_cleaned_2013_2014.rds")
Code
ggplot(adult, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "white") +
  labs(
    title = "Distribution of Age >20 years",
    x = "Age (years)",
    y = "Count"
  ) +
  theme_minimal()

Code
ggplot(adult, aes(factor(diabetes_dx))) +
  geom_bar(fill = "steelblue") +
  labs(title="Diabetes Outcome Distribution in >20 years age group", x="diabetes_dx (0=No, 1=Yes)", y="Count")

Code
ggplot(adult, aes(factor(bmi_cat))) +
  geom_bar(fill = "steelblue") +
  labs(title="Diabetes Outcome Distribution by BMI in >20 years age group", x="bmi_cat")

Code
ggplot(adult, aes(x = factor(diabetes_dx), y = bmi)) +
  geom_boxplot(fill = "skyblue") +
  labs(
    title = "BMI Distribution by Diabetes Diagnosis in >20 years age group",
    x = "Diabetes Diagnosis (0 = No, 1 = Yes)",
    y = "BMI"
  ) +
  theme_minimal()

Code
# plots for adult data bmi categories and race categories

ggplot(adult, aes(x = factor(race), fill = factor(diabetes_dx))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Diabetes Diagnosis by Race in >20 years age group",
    x = "Race/Ethnicity",
    y = "Count",
    fill = "Diabetes Diagnosis\n(0 = No, 1 = Yes)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
ggplot(adult, aes(x = factor(bmi_cat), fill = factor(diabetes_dx))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Diabetes Diagnosis by BMI in >20 years age group",
    x = "BMI",
    y = "Count",
    fill = "Diabetes Diagnosis\n(0 = No, 1 = Yes)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Statistical Methods and Modeling

Multiple Logistic regression on survey weighted dataset

-We conducted frequentist method Multiple Logistic regression on a survey-weighted dataset, for complete case analysis and data exploration

Multivariate Imputation by Chained Equations (MICE)

  • We conducted MICE to manage missiging data - Considering the small sample size, Multivariate Imputation by Chained Equations (MICE) was conducted as an alternative to the Bayesian Approach Buuren and Groothuis-Oudshoorn (2011)
  • Multiple imputation (MI) is an alternative analytic approach for small dataset with missingness.
  • Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable
  • Multiple Imputation (MI) can be performed using mice package in R
  • Iterative MICE imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
  • In the chain process, each imputed variable become a predictor for the subsequent imputation, and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.

Bayesian Logistic Regression

  • We conduct bayesian logisitic regression to estimate the association between BMI, age, sex, and race/ethnicity and predict doctor-diagnosed diabetes (DIQ010).

Bayesian statistics is about updating beliefs with evidence: Posterior ∝ Likelihood × Prior - Prior (p(θ)): Your initial belief about a parameter before seeing the data. - Likelihood (p(y|θ)): How probable the observed data are given the parameters. This is derived from the model (e.g., logistic regression likelihood). - Posterior (p(θ|y)): Your updated belief about the parameter after seeing the data. - We selected Bayesian Logistic Regression since our study response variable is a binary outcome (Diabetes:yes/no) - Bayesian Logistic Regression is based on binomial probability Bayes’ rules, and predicts probability of disease outcome - Bayes analyzes linear relation between the predictor (Age, Race, BMI, Gender) and outcome response variable (Diabetes). - it considers that predictors and response variables are independent. - Regression a of a discrete variable (0 or 1) is a Bernoulli probability model that classifies categorical response variables - predicting Diabetes. - Logit link provides probabilities for the response variable. - We use Weakly informative priors Normal (0, 2.5) for logistic regression coefficients and intercept, Normal(0, 10), allows a wide range of baseline log-odds and helps with convergence and avoids extreme estimates. Good default for most applications in social, health, or epidemiological studies. - In Bayesian statistics, every unknown parameter (like a regression coefficient, mean, or variance) is treated as a random variable with a probability distribution that reflects uncertainty.

Summaries od Bayesian regression include - -Posterior Predictive Probabilities -Posterior Mean, Median, credible Intervals -Posterior Probability (Outcome=1) -Comparison with External Prevalence (population prevalence) -Posterior Model Fit Metrics -Prior versus Posterior Coefficient Distributions -Posterior Predictive Checks -Uncertainty Quantification

Model Equation

Bayesian Logistic Regression model

\[ \text{logit}(P(Y_i=1)) = \beta_0 + \beta_1 \cdot Age_i + \beta_2 \cdot BMI_i + \beta_3 \cdot Race_i + \beta_4 \cdot Gender_i \]

Linear Regression equation:

\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \]

Diagnostics for Bayesian Logictic Regression

Trace Plots for Markov Chain Monte Carlo Convergence Autocorrelation Plots Potential Scale Reduction Factor Assessment Posterior Predictive Checks Residual Analysis Bayes Factor Comparison Prior Sensitivity Analysis Model Fit Assessment Posterior Predictive Probability Plots Posterior Interval Coverage Evaluation Convergence Diagnostics across Chains

Code
# Modeling

library(broom)
library(mice)
library(brms)
library(posterior)
library(bayesplot)
library(knitr)

# --- Guardrails for modeling ---
n_outcome <- sum(!is.na(adult$diabetes_dx))
if (n_outcome == 0) stop("Too few non-missing outcomes for modeling. n = 0")

# Ensure factors and >=2 observed levels among complete outcomes
adult <- adult %>%
  dplyr::mutate(
    sex  = if (!is.factor(sex))  factor(sex)  else sex,
    race = if (!is.factor(race)) factor(race) else race
  )

if (nlevels(droplevels(adult$sex[!is.na(adult$diabetes_dx)]))  < 2)
  stop("sex has <2 observed levels after filtering; check data availability.")
if (nlevels(droplevels(adult$race[!is.na(adult$diabetes_dx)])) < 2)
  stop("race has <2 observed levels after filtering; check Data Prep.")

# ------------------------- 1) Survey-weighted complete-case -------------------------
# Build a logical filter on the original adult data (same length as design$data)
keep_cc <- with(
  adult,
  !is.na(diabetes_dx) & !is.na(age_c) & !is.na(bmi_c) &
  !is.na(sex) & !is.na(race)
)

# Subset the survey design using the logical vector (same length as original)
des_cc <- subset(nhanes_design_adult, keep_cc)

# Corresponding complete-case data (optional)
cc <- adult[keep_cc, ] |> droplevels()
cat("\nComplete-case N for survey-weighted model:", nrow(cc), "\n")

Complete-case N for survey-weighted model: 5349 
Code
print(table(cc$race))

        NH White Mexican American   Other Hispanic         NH Black 
            2293              713              470             1101 
     Other/Multi 
             772 
Code
print(table(cc$diabetes_dx))

   0    1 
4752  597 
Code
print(table(cc$sex))

  Male Female 
  2551   2798 
Code
form_cc <- diabetes_dx ~ age_c + bmi_c + sex + race
svy_fit <- survey::svyglm(formula = form_cc, design = des_cc, family = quasibinomial())
summary(svy_fit)

Call:
svyglm(formula = form_cc, design = des_cc, family = quasibinomial())

Survey design:
subset(nhanes_design_adult, keep_cc)

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -2.67143    0.11935 -22.383 1.68e-08 ***
age_c                 1.10833    0.05042  21.981 1.94e-08 ***
bmi_c                 0.63412    0.05713  11.099 3.88e-06 ***
sexFemale            -0.63844    0.10926  -5.843 0.000386 ***
raceMexican American  0.71091    0.13681   5.196 0.000826 ***
raceOther Hispanic    0.46469    0.13474   3.449 0.008712 ** 
raceNH Black          0.51221    0.15754   3.251 0.011677 *  
raceOther/Multi       0.84460    0.17756   4.757 0.001433 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.8455444)

Number of Fisher Scoring iterations: 6
Code
plot(residuals(svy_fit, type='deviance'))

Code
# Survey-weighted OR table (no intercept)
svy_or <- broom::tidy(svy_fit, conf.int = TRUE) %>%
  dplyr::mutate(OR = exp(estimate), LCL = exp(conf.low), UCL = exp(conf.high)) %>%
  dplyr::select(term, OR, LCL, UCL, p.value) %>%
  dplyr::filter(term != "(Intercept)")
knitr::kable(svy_or, caption = "Survey-weighted odds ratios (per 1 SD)")
Survey-weighted odds ratios (per 1 SD)
term OR LCL UCL p.value
age_c 3.0292807 2.6967690 3.4027912 0.0000000
bmi_c 1.8853571 1.6526296 2.1508579 0.0000039
sexFemale 0.5281132 0.4104905 0.6794397 0.0003857
raceMexican American 2.0358434 1.4850041 2.7910081 0.0008262
raceOther Hispanic 1.5915182 1.1664529 2.1714810 0.0087119
raceNH Black 1.6689718 1.1605895 2.4000450 0.0116773
raceOther/Multi 2.3270527 1.5451752 3.5045697 0.0014331

The residual plot looks okay and does not show any pattern.

Code
fit_mi <- with(imp, {
  age_c <- as.numeric(scale(age))
  bmi_c <- as.numeric(scale(bmi))
  glm(diabetes_dx ~ age_c + bmi_c + sex + race, family = binomial())
})
pool_mi <- pool(fit_mi)
summary(pool_mi)
                  term   estimate  std.error  statistic       df       p.value
1          (Intercept) -2.6895645 0.09941301 -27.054453 5566.204 1.486581e-151
2                age_c  1.0660265 0.05594733  19.054108 5520.446  1.911564e-78
3                bmi_c  0.5468538 0.04473386  12.224604 5148.557  6.751227e-34
4            sexFemale -0.6178297 0.09379129  -6.587282 5551.660  4.892566e-11
5 raceMexican American  0.8877355 0.13750463   6.456041 5472.583  1.167455e-10
6   raceOther Hispanic  0.5606621 0.17485537   3.206433 5573.987  1.351505e-03
7         raceNH Black  0.6809629 0.11981185   5.683602 5576.734  1.385727e-08
8      raceOther/Multi  0.7476406 0.15300663   4.886328 4749.963  1.061140e-06
Code
## table 

mi_or <- summary(pool_mi, conf.int = TRUE, exponentiate = TRUE) %>%
  dplyr::rename(
    term = term, OR = estimate, LCL = `2.5 %`, UCL = `97.5 %`, p.value = p.value
  ) %>%
  dplyr::filter(term != "(Intercept)")
knitr::kable(mi_or, caption = "MI pooled odds ratios (per 1 SD)")
MI pooled odds ratios (per 1 SD)
term OR std.error statistic df p.value LCL UCL conf.low conf.high
2 age_c 2.9038183 0.0559473 19.054108 5520.446 0.0000000 2.6021752 3.2404277 2.6021752 3.2404277
3 bmi_c 1.7278084 0.0447339 12.224604 5148.557 0.0000000 1.5827382 1.8861754 1.5827382 1.8861754
4 sexFemale 0.5391132 0.0937913 -6.587282 5551.660 0.0000000 0.4485669 0.6479368 0.4485669 0.6479368
5 raceMexican American 2.4296216 0.1375046 6.456041 5472.583 0.0000000 1.8555327 3.1813298 1.8555327 3.1813298
6 raceOther Hispanic 1.7518320 0.1748554 3.206433 5573.987 0.0013515 1.2434346 2.4680953 1.2434346 2.4680953
7 raceNH Black 1.9757793 0.1198118 5.683602 5576.734 0.0000000 1.5621842 2.4988753 1.5621842 2.4988753
8 raceOther/Multi 2.1120110 0.1530066 4.886328 4749.963 0.0000011 1.5646727 2.8508138 1.5646727 2.8508138

Prior - We used student_t(3, 0, 10) prior R. V. D. Schoot et al. (2013) - student_t(ν,μ,σ)

Student’s t-distribution with: ν=3: degrees of freedom (controls tail heaviness) μ=0: location (center, like the mean) σ=10: scale (spread, like standard deviation) student_t(3, 0, 10) means: It’s centered at 0 It allows large values (± several times 10) It has heavy tails, since ν=3, allows for outliers or unexpected large parameter values Helps the model remain stable, especially with: # Small datasets # High correlation between predictors # Potential outliers in the data

Code
library(gt)

# 3) Bayesian Logistic Regression (formula weights) 
adult_imp1 <- complete(imp, 1) %>%
  dplyr::mutate(
    age_c  = as.numeric(scale(age)),
    bmi_c  = as.numeric(scale(bmi)),
    wt_norm = WTMEC2YR / mean(WTMEC2YR, na.rm = TRUE),
    # ensure factor refs match survey/MICE:
    race = forcats::fct_relevel(race, "NH White"),
    sex  = forcats::fct_relevel(sex,  "Male")
  ) %>%
  dplyr::filter(!is.na(diabetes_dx), !is.na(age_c), !is.na(bmi_c),
                !is.na(sex), !is.na(race)) %>%
  droplevels()

stopifnot(all(is.finite(adult_imp1$wt_norm)))


glimpse(adult_imp1)
Rows: 5,592
Columns: 11
$ diabetes_dx <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ age         <dbl> 69, 54, 72, 73, 56, 61, 42, 56, 65, 26, 76, 33, 32, 38, 50…
$ bmi         <dbl> 26.7, 28.6, 28.9, 19.7, 41.7, 35.7, 23.6, 26.5, 22.0, 20.3…
$ sex         <fct> Male, Male, Male, Female, Male, Female, Male, Female, Male…
$ race        <fct> NH Black, NH White, NH White, NH White, Mexican American, …
$ WTMEC2YR    <dbl> 13481.04, 24471.77, 57193.29, 65541.87, 25344.99, 61758.65…
$ SDMVPSU     <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ SDMVSTRA    <dbl> 112, 108, 109, 116, 111, 114, 106, 112, 112, 113, 116, 114…
$ age_c       <dbl> 1.13241831, 0.27835981, 1.30323001, 1.36016725, 0.39223428…
$ bmi_c       <dbl> -0.33319172, -0.06755778, -0.02561558, -1.31184309, 1.7639…
$ wt_norm     <dbl> 0.3393916, 0.6160884, 1.4398681, 1.6500477, 0.6380722, 1.5…
Code
priors <- c(
  set_prior("normal(0, 2.5)", class = "b"),
  set_prior("student_t(3, 0, 10)", class = "Intercept") 
)

bayes_fit <- brm(
  formula = diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race,
  data    = adult_imp1,
  family  = bernoulli(link = "logit"),
  prior   = priors,
  chains  = 4, iter = 2000, seed = 123,
  control = list(adapt_delta = 0.95),
  refresh = 0   # quiet Stan output
)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG   -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported"  -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/"  -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/"  -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include    -fpic  -g -O2  -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
                 from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
                 from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
                 from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
  679 | #include <cmath>
      |          ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
Code
summary(bayes_fit)            # Bayesian model summary
 Family: bernoulli 
  Links: mu = logit 
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race 
   Data: adult_imp1 (Number of observations: 5592) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -2.66      0.09    -2.84    -2.50 1.00     4187     3510
age_c                   1.09      0.06     0.97     1.22 1.00     3012     3098
bmi_c                   0.63      0.05     0.53     0.72 1.00     3472     3315
sexFemale              -0.66      0.10    -0.86    -0.46 1.00     4003     3052
raceMexicanAmerican     0.69      0.18     0.35     1.04 1.00     3526     2843
raceOtherHispanic       0.43      0.24    -0.07     0.89 1.00     4058     3114
raceNHBlack             0.54      0.15     0.24     0.82 1.00     3597     3177
raceOtherDMulti         0.82      0.19     0.45     1.19 1.00     3763     3257

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
prior_summary(bayes_fit)
               prior     class                coef group resp dpar nlpar lb ub
      normal(0, 2.5)         b                                                
      normal(0, 2.5)         b               age_c                            
      normal(0, 2.5)         b               bmi_c                            
      normal(0, 2.5)         b raceMexicanAmerican                            
      normal(0, 2.5)         b         raceNHBlack                            
      normal(0, 2.5)         b     raceOtherDMulti                            
      normal(0, 2.5)         b   raceOtherHispanic                            
      normal(0, 2.5)         b           sexFemale                            
 student_t(3, 0, 10) Intercept                                                
 tag       source
             user
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
             user
Code
###
library(ggplot2)

# priors for two coefficients (age and bmi) prior = N(0,2.5)
prior_draws <- tibble(
  term = rep(c("Age (per 1 SD)", "BMI (per 1 SD)"), each = 4000),
  value = c(rnorm(4000, 0, 2.5), rnorm(4000, 0, 2.5))
)

ggplot(prior_draws, aes(x = value, fill = term)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  labs(title = "Prior Distributions for Coefficients",
       x = "Coefficient Value", y = "Density") +
  scale_fill_manual(values = c("skyblue", "orange"))

Code
# Diabetes vs BMI

library(ggplot2)

# Create the plot
ggplot(adult_imp1, aes(x = factor(diabetes_dx), y = bmi, fill = factor(diabetes_dx))) +
  geom_boxplot(alpha = 0.7) +
  scale_x_discrete(labels = c("0" = "No Diabetes", "1" = "Diabetes")) +
  labs(
    x = "Diabetes Diagnosis",
    y = "BMI",
    title = "BMI Distribution by Diabetes Status"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Code
# logistic regression curve
ggplot(adult_imp1, aes(x = bmi, y = diabetes_dx)) +
  geom_point(aes(y = diabetes_dx), alpha = 0.2, position = position_jitter(height = 0.02)) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE, color = "blue") +
  labs(
    x = "BMI",
    y = "Probability of Diabetes",
    title = "Predicted Probability of Diabetes vs BMI"
  ) +
  theme_minimal()

Once we get posterior draws, we study Summary stats Mean, median, 95% credible intervals summary(bayes_fit) or posterior_summary(bayes_fit) - Visualization Distribution shape mcmc_hist(posterior, pars = c(“b_age”)) or mcmc_areas() - Pairwise plots Correlations between parameters mcmc_pairs(posterior) - Posterior predictive checks Compare model predictions vs observed data pp_check(bayes_fit) - Model comparison Using LOO or WAIC loo(bayes_fit) or waic(bayes_fit)

Assumptions for Bayesian logistic regression - posterior check - plots

for linearity - mcmc trace plots for convergence - bayes_R2 for model fit

Code
summary(bayes_fit)
 Family: bernoulli 
  Links: mu = logit 
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race 
   Data: adult_imp1 (Number of observations: 5592) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -2.66      0.09    -2.84    -2.50 1.00     4187     3510
age_c                   1.09      0.06     0.97     1.22 1.00     3012     3098
bmi_c                   0.63      0.05     0.53     0.72 1.00     3472     3315
sexFemale              -0.66      0.10    -0.86    -0.46 1.00     4003     3052
raceMexicanAmerican     0.69      0.18     0.35     1.04 1.00     3526     2843
raceOtherHispanic       0.43      0.24    -0.07     0.89 1.00     4058     3114
raceNHBlack             0.54      0.15     0.24     0.82 1.00     3597     3177
raceOtherDMulti         0.82      0.19     0.45     1.19 1.00     3763     3257

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
plot(bayes_fit)   # Posterior distributions

Code
pp_check(bayes_fit)      # Posterior predictive checks

Code
mcmc_trace(bayes_fit)    # Convergence (optional)

Code
bayes_R2(bayes_fit)      # Model fit
    Estimate  Est.Error      Q2.5    Q97.5
R2 0.1313342 0.01265055 0.1064607 0.156078
Code
    # Leave-one-out cross-validation

Estimated R²

Estimated R² = 0.13 (13.1%) - Predictors are relevant, most of the variability remains unexplained. Other factors (like genetics, lifestyle, environment, etc.) might be contributing. Below are reported odds ratio from the posterior predicted values and the Bayesian regression summary

Code
# Posterior ORs (drop intercept, clean labels)

bayes_or <- posterior_summary(bayes_fit, pars = "^b_") %>%
  as.data.frame() %>%
  tibble::rownames_to_column("raw") %>%
  dplyr::mutate(
    term = gsub("^b_", "", raw),
    term = gsub("race", "race:", term),
    term = gsub("sex",  "sex:",  term),
    term = gsub("OtherDMulti", "Other/Multi", term),
    term = gsub("OtherHispanic", "Other Hispanic", term),
    OR   = exp(Estimate),
    LCL  = exp(Q2.5),
    UCL  = exp(Q97.5)
  ) %>%
  dplyr::select(term, OR, LCL, UCL) %>%
  dplyr::filter(term != "Intercept")

knitr::kable(
  bayes_or %>%
    dplyr::mutate(dplyr::across(c(OR,LCL,UCL), ~round(.x, 2))),
  digits = 2,
  caption = "Bayesian posterior odds ratios (95% CrI) — reference: NH White (race), Male (sex)"
)
Bayesian posterior odds ratios (95% CrI) — reference: NH White (race), Male (sex)
term OR LCL UCL
age_c 2.99 2.64 3.37
bmi_c 1.87 1.71 2.05
sex:Female 0.52 0.42 0.63
race:MexicanAmerican 2.00 1.41 2.84
race:Other Hispanic 1.54 0.93 2.43
race:NHBlack 1.71 1.28 2.27
race:Other/Multi 2.27 1.56 3.28
Code
# Combined table

if (!dir.exists("outputs")) dir.create("outputs", recursive = TRUE)
saveRDS(svy_fit,   "outputs/svy_fit.rds")
saveRDS(pool_mi,   "outputs/pool_mi.rds")
saveRDS(bayes_fit, "outputs/bayes_fit.rds")
saveRDS(svy_or,    "outputs/survey_OR_table.rds")
saveRDS(mi_or,     "outputs/mi_OR_table.rds")
saveRDS(bayes_or,  "outputs/bayes_OR_table.rds")
Code
# Results

# ---- Build compact results table (BMI & Age only) ----
library(dplyr); 
library(tidyr); 
library(knitr); 
library(stringr)

# pretty "OR (LCL–UCL)" string

fmt_or <- function(or, lcl, ucl, digits = 2) {
  paste0(
    formatC(or,  format = "f", digits = digits), " (",
    formatC(lcl, format = "f", digits = digits), "–",
    formatC(ucl, format = "f", digits = digits), ")"
  )
}

# guardrails: require these to exist from Modeling
stopifnot(exists("svy_or"), exists("mi_or"), exists("bayes_or"))
for (nm in c("svy_or","mi_or","bayes_or")) {
  if (!all(c("term","OR","LCL","UCL") %in% names(get(nm)))) {
    stop(nm, " must have columns: term, OR, LCL, UCL")
  }
}

svy_tbl   <- svy_or   %>% mutate(Model = "Survey-weighted MLE")
mi_tbl    <- mi_or    %>% mutate(Model = "MICE pooled")
bayes_tbl <- bayes_or %>% mutate(Model = "Bayesian")

all_tbl <- bind_rows(svy_tbl, mi_tbl, bayes_tbl) %>%
  mutate(term = case_when(
    str_detect(term, "bmi_c|\\bBMI\\b") ~ "BMI (per 1 SD)",
    str_detect(term, "age_c|\\bAge\\b") ~ "Age (per 1 SD)",
    TRUE ~ term
  )) %>%
  filter(term %in% c("BMI (per 1 SD)", "Age (per 1 SD)")) %>%
  mutate(OR_CI = fmt_or(OR, LCL, UCL, digits = 2)) %>%
  select(Model, term, OR_CI) %>%
  arrange(
    factor(Model, levels = c("Survey-weighted MLE","MICE pooled","Bayesian")),
    factor(term,  levels = c("BMI (per 1 SD)","Age (per 1 SD)"))
  )

res_wide <- all_tbl %>%
  pivot_wider(names_from = term, values_from = OR_CI) %>%
  rename(
    `BMI (per 1 SD) OR (95% CI)` = `BMI (per 1 SD)`,
    `Age (per 1 SD) OR (95% CI)` = `Age (per 1 SD)`
  )

kable(
  res_wide,
  align = c("l","c","c"),
  caption = "Odds ratios (per 1 SD) with 95% CIs across models"
)
Odds ratios (per 1 SD) with 95% CIs across models
Model BMI (per 1 SD) OR (95% CI) Age (per 1 SD) OR (95% CI)
Survey-weighted MLE 1.89 (1.65–2.15) 3.03 (2.70–3.40)
MICE pooled 1.73 (1.58–1.89) 2.90 (2.60–3.24)
Bayesian 1.87 (1.71–2.05) 2.99 (2.64–3.37)
Code
# Posterior predictive draws

#Posterior predictive checks (binary outcome)
pp_samples <- posterior_predict(bayes_fit, ndraws = 500)  # 500 draws

# Check dimensions
dim(pp_samples)  # rows = draws, cols = observations
[1]  500 5592
Code
# Plot overlay of observed vs predicted counts (duplicate image)
ppc_dens_overlay(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:50, ]) +
  labs(title = "Posterior Predictive Check: Density Overlay") +
  theme_minimal()

Code
ppc_bars(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:50, ])

Code
# Alternative PP plots (histogram / barplot) for binary outcome (bar chart preferred being discrete outcome)

ppc_bars(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:50, ]) +
  labs(title = "Posterior Predictive Check: Barplot of Counts") +
  theme_minimal()

Code
#PP check for proportions (useful for binary) # mean comparison
## to check if the simulated means match the observed mean

## mean
ppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat = "mean") +
  labs(title = "Posterior Predictive Check: Mean of Replicates") +
  theme_minimal()

Code
## sd
ppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat = "sd") +
  labs(title = "PPC: Standard Deviation of Replicates") +
  theme_minimal()

Code
# PP checks with bayesplot options
color_scheme_set("blue")
ppc_scatter_avg(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ]) +
  labs(title = "Observed vs Predicted (Avg) Posterior Predictive")

Code
library(brms)
library(dplyr)

# Posterior summary

post_sum <- posterior_summary(bayes_fit)
colnames(post_sum)
[1] "Estimate"  "Est.Error" "Q2.5"      "Q97.5"    
Code
library(posterior)
library(bayesplot)

# Extract posterior draws as a draws_df # simulate posterior outcomes
post <- as_draws_df(bayes_fit)

# Check parameter names
colnames(post)
 [1] "b_Intercept"           "b_age_c"               "b_bmi_c"              
 [4] "b_sexFemale"           "b_raceMexicanAmerican" "b_raceOtherHispanic"  
 [7] "b_raceNHBlack"         "b_raceOtherDMulti"     "Intercept"            
[10] "lprior"                "lp__"                  ".chain"               
[13] ".iteration"            ".draw"                
Code
# Density overlay for age and bmi
mcmc_areas(post, pars = c( "b_age_c","b_bmi_c","b_sexFemale","b_raceMexicanAmerican", "b_raceOtherHispanic","b_raceNHBlack","b_raceOtherDMulti" ))

Code
###
predicted <- fitted(bayes_fit, summary = TRUE)
observed <- adult_imp1[, c("bmi", "age")]

# Plot for **bmi** (obs vs pred)

library(ggplot2)
ggplot(data = NULL, aes(x = observed$bmi, y = predicted[, "Estimate"])) +
  geom_point() +
  geom_errorbar(aes(ymin = predicted[, "Q2.5"], ymax = predicted[, "Q97.5"])) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  xlab("Observed bmi") + ylab("Predicted bmi")

Code
# Combine observed and predicted into one data frame
plot_data <- adult_imp1 %>%
  mutate(
    predicted_bmi = predicted[, "Estimate"],
    lower_ci = predicted[, "Q2.5"],
    upper_ci = predicted[, "Q97.5"],
    obs_index = 1:nrow(adult_imp1)  # index for x-axis
  )

# Line plot
ggplot(plot_data, aes(x = obs_index)) +
  geom_line(aes(y = bmi, color = "Observed")) +               # observed BMI
  geom_line(aes(y = predicted_bmi, color = "Predicted")) +   # predicted BMI
  geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci), alpha = 0.2) +  # uncertainty
  labs(x = "Observation", y = "BMI", color = "Legend") +
  theme_minimal()

Code
###
summary(adult_imp1$bmi)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   14.1    24.1    27.7    29.0    32.4    82.9 
Code
summary(plot_data$bmi_c)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.09476 -0.69669 -0.19338 -0.01133  0.46371  7.52398 
Code
prior_summary(bayes_fit)
               prior     class                coef group resp dpar nlpar lb ub
      normal(0, 2.5)         b                                                
      normal(0, 2.5)         b               age_c                            
      normal(0, 2.5)         b               bmi_c                            
      normal(0, 2.5)         b raceMexicanAmerican                            
      normal(0, 2.5)         b         raceNHBlack                            
      normal(0, 2.5)         b     raceOtherDMulti                            
      normal(0, 2.5)         b   raceOtherHispanic                            
      normal(0, 2.5)         b           sexFemale                            
 student_t(3, 0, 10) Intercept                                                
 tag       source
             user
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
             user
Code
prior_draws <- tibble(
  term = rep(c("BMI (per 1 SD)", "Age (per 1 SD)"), each = 4000),
  estimate = c(rnorm(4000, 0, 1), rnorm(4000, 0, 1)),
  type = "Prior"
)

summary(prior_draws)
     term              estimate             type          
 Length:8000        Min.   :-3.585220   Length:8000       
 Class :character   1st Qu.:-0.697054   Class :character  
 Mode  :character   Median : 0.000046   Mode  :character  
                    Mean   :-0.007350                     
                    3rd Qu.: 0.668561                     
                    Max.   : 3.922817                     
Code
post
# A draws_df: 1000 iterations, 4 chains, and 11 variables
   b_Intercept b_age_c b_bmi_c b_sexFemale b_raceMexicanAmerican
1         -2.6     1.1    0.70       -0.71                  0.67
2         -2.7     1.0    0.62       -0.57                  0.65
3         -2.6     1.1    0.65       -0.76                  0.63
4         -2.7     1.0    0.65       -0.67                  0.82
5         -2.6     1.1    0.61       -0.73                  0.75
6         -2.5     1.0    0.60       -0.77                  0.61
7         -2.8     1.1    0.66       -0.66                  0.52
8         -2.8     1.2    0.67       -0.57                  0.94
9         -2.8     1.1    0.65       -0.52                  0.84
10        -2.6     1.1    0.67       -0.85                  0.70
   b_raceOtherHispanic b_raceNHBlack b_raceOtherDMulti
1                0.605          0.52              0.95
2                0.338          0.45              0.69
3                0.566          0.63              0.54
4                0.453          0.61              0.78
5                0.090          0.50              0.62
6                0.015          0.48              0.60
7                0.736          0.50              0.84
8                0.913          0.57              1.07
9                0.570          0.66              0.81
10               0.467          0.54              0.97
# ... with 3990 more draws, and 3 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
Code
head(post)
# A draws_df: 6 iterations, 1 chains, and 11 variables
  b_Intercept b_age_c b_bmi_c b_sexFemale b_raceMexicanAmerican
1        -2.6     1.1    0.70       -0.71                  0.67
2        -2.7     1.0    0.62       -0.57                  0.65
3        -2.6     1.1    0.65       -0.76                  0.63
4        -2.7     1.0    0.65       -0.67                  0.82
5        -2.6     1.1    0.61       -0.73                  0.75
6        -2.5     1.0    0.60       -0.77                  0.61
  b_raceOtherHispanic b_raceNHBlack b_raceOtherDMulti
1               0.605          0.52              0.95
2               0.338          0.45              0.69
3               0.566          0.63              0.54
4               0.453          0.61              0.78
5               0.090          0.50              0.62
6               0.015          0.48              0.60
# ... with 3 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
Code
names(prior_draws)
[1] "term"     "estimate" "type"    
Code
library(brms)
library(tidyr)

# Extract posterior draws
post <- as_draws_df(bayes_fit) %>%      # bayes_fit = your brms model
  select(b_bmi_c, b_age_c) %>%               # select your coefficient columns
  pivot_longer(
    everything(),
    names_to = "term",
    values_to = "estimate"
  ) %>%
  mutate(
    term = case_when(
      term == "b_bmi_c" ~ "BMI (per 1 SD)",
      term == "b_age_c" ~ "Age (per 1 SD)"
    ),
    type = "Posterior"
  )


## visualization of prior and predicted draws
combined_draws <- bind_rows(prior_draws, post) 

library(ggplot2)

ggplot(combined_draws, aes(x = estimate, fill = type)) +
  geom_density(alpha = 0.4) +
  facet_wrap(~ term, scales = "free", ncol = 2) +
  theme_minimal(base_size = 13) +
  labs(
    title = "Prior vs Posterior Distributions",
    x = "Coefficient estimate",
    y = "Density",
    fill = ""
  )

Code
#### Compute proportion of diabetes=1 for each draw
pp_proportion <- rowMeans(pp_samples)  # proportion of 1's in each posterior draw


# Summary of posterior proportions
summary(pp_proportion)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09067 0.10533 0.10944 0.10935 0.11320 0.12715 
Code
# Optional: visualize the posterior probability distribution
pp_proportion_df <- tibble(proportion = pp_proportion)

ggplot(pp_proportion_df, aes(x = proportion)) +
  geom_histogram(binwidth = 0.01, fill = "skyblue", color = "black") +
  labs(
    title = "Posterior Distribution of Proportion of Diabetes = 1",
    x = "Proportion of Diabetes = 1",
    y = "Frequency"
  ) +
  theme_minimal()

Code
####

library(tidyverse)

# Posterior predicted proportion vector
# pp_proportion <- rowMeans(pp_samples)  # if not already done

known_prev <- 0.089   # NHANES prevalence

# Posterior summary
posterior_mean <- mean(pp_proportion)
posterior_ci <- quantile(pp_proportion, c(0.025, 0.975))  # 95% credible interval

# Create a data frame for plotting
pp_df <- tibble(proportion = pp_proportion)

# Plot
ggplot(pp_df, aes(x = proportion)) +
  geom_histogram(binwidth = 0.005, fill = "skyblue", color = "black") +
  geom_vline(xintercept = known_prev, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = posterior_mean, color = "blue", linetype = "solid", size = 1) +
  geom_rect(aes(xmin = posterior_ci[1], xmax = posterior_ci[2], ymin = 0, ymax = Inf),
            fill = "blue", alpha = 0.1, inherit.aes = FALSE) +
  labs(
    title = "Posterior Predicted Diabetes Proportion vs NHANES Prevalence",
    subtitle = paste0("Red dashed = NHANES prevalence (", known_prev, 
                      "), Blue solid = Posterior mean (", round(posterior_mean,3), ")"),
    x = "Proportion of Diabetes = 1",
    y = "Frequency"
  ) +
  theme_minimal()

Code
library(dplyr)

# Posterior predicted proportion
posterior_mean <- mean(pp_proportion)
posterior_ci <- quantile(pp_proportion, c(0.025, 0.975))  # 95% credible interval

# NHANES prevalence with SE from survey::svymean
# Suppose you already have:
# svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
known_prev <- 0.089        # Mean prevalence
known_se   <- 0.0048       # Standard error from survey

# Calculate 95% confidence interval
known_ci <- c(
  known_prev - 1.96 * known_se,
  known_prev + 1.96 * known_se
)

# Print results
data.frame(
  Type = c("Posterior Prediction", "NHANES Prevalence"),
  Mean = c(posterior_mean, known_prev),
  Lower_95 = c(posterior_ci[1], known_ci[1]),
  Upper_95 = c(posterior_ci[2], known_ci[2])
)
                     Type      Mean   Lower_95  Upper_95
2.5% Posterior Prediction 0.1093519 0.09781831 0.1212446
        NHANES Prevalence 0.0890000 0.07959200 0.0984080
Code
####
library(ggplot2)
library(dplyr)

# Create a data frame for plotting
ci_df <- data.frame(
  Type = c("Posterior Prediction", "NHANES Prevalence"),
  Mean = c(0.1096674, 0.089),
  Lower_95 = c(0.09772443, 0.079592),
  Upper_95 = c(0.1210658, 0.098408)
)

# Plot
ggplot(ci_df, aes(x = Type, y = Mean, color = Type)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = Lower_95, ymax = Upper_95), width = 0.2) +
  ylim(0, max(ci_df$Upper_95) + 0.02) +
  labs(
    title = "Comparison of Posterior Predicted Diabetes Proportion vs NHANES Prevalence",
    y = "Proportion of Diabetes",
    x = ""
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

Code
# Save your dataset as CSV
write.csv(adult_imp1, "adult_imp1.csv", row.names = FALSE)

Results

Multiple Linear Regression

Model**:**
`svyglm(formula = form_cc, design = des_cc, family = quasibinomial())`

All predictors are significant: age (p < 0.001) and BMI (p < 0.001) show strong positive associations with the outcome, while being female (p = 0.0004) is negatively associated. Other significant associations include raceMexican American (p = 0.0008), raceOther Hispanic (p = 0.0087), raceNH Black (p = 0.0117), and raceOther/Multi (p = 0.0014).

MICE

All predictors are statistically significant.

Positive associations: age (p \< 0.001), BMI (p \< 0.001), and all
race categories compared to reference.

Negative association: being female (p \< 0.001)

Bayesian Logistic Regression

Sampling**:** NUTS (4 chains, 2000 iterations each; 1000 warmup,
4000 post-warmup draws)

Convergence & Diagnostics

  • Rhat = 1.00 for all parameters → excellent convergence
  • Bulk_ESS / Tail_ESS: Large values (>2000) → high effective sample sizes, reliable posterior estimates.

Interpretation

  • Strong predictors: Age and BMI are strongly positively associated with diabetes risk.
  • Sex effect: Females have a lower probability of diabetes compared to males
  • R² = 0.13 shows 13% of the variance in the outcome (diabetes_dx) is explained by your predictors (age, BMI, sex, race), 95% Credible Interval: 0.106–0.156, indicates that, given your model and data, the true proportion of variance explained is plausibly between 10.6% and 15.6% and shows uncertainty in the explained variance, which is natural for probabilistic models.
  • Est.Error = 0.0127, reflects the standard error of the R² estimate across posterior samples. The small SE indicates that the R² estimate is fairly precise.
  • Race/ethnicity: Mexican American, NH Black, and “Other/Diverse” groups have higher odds of diabetes. Other Hispanic group has a less certain effect.
  • age_c-Each 1-unit increase in centered age increases the log-odds of diabetes by 1.09. Strong positive effect.
  • bmi_c-Higher BMI is associated with higher diabetes risk. sexFemale-Females have lower odds of diabetes compared to males.
  • raceMexicanAmerican-Higher odds of diabetes vs. reference race (likely NH White)
  • raceOtherHispanic-Slightly higher odds vs reference, but interval crosses zero → uncertain effect.
  • raceNHBlack-Significantly higher odds of diabetes compared to reference. raceOtherDMulti-Higher odds of diabetes vs reference group.

Posterior distribution of all parameters in the model.

Density - plot of posterior samples each parameter (e.g., intercept, slope) into a smoothed density curve, showing most of the posterior probability mass lies for best estimates and uncertainty.

Posterior Predictive Distribution - generated from posterior

  • predictive draws: 𝑦rep∼𝑝(𝑦new∣𝜃)yrep​∼p(ynew​∣θ) simulate the data given posterior parameter estimates.
  • Posterior predictive checks (PPC) compare these simulations to real data to assess model fit.

Incorporating Uncertainty two sources of uncertainty:

Parameter uncertainty: captured in the posterior distributions Predictive uncertainty: captured in posterior predictive draws

Combining the two provide credible intervals for predictions, not just point predictions and specifies - Given the BMI, the probability of diabetes is likely between 40–55%.

Comparing Models

  • All three models (survey-weighted MLE, multiple imputation, Bayesian) agree closely on the direction and magnitude of the effects of BMI and age.
  • Age is a stronger predictor than BMI, nearly tripling the odds per 1 SD.
  • BMI significantly increases diabetes risk (~1.7–1.9× per 1 SD).
  • Differences between models are minor, indicating robust and reliable findings despite missing data or modeling approach.

Diabetes Predicted proportion vs population prevalence

  • Posterior predicted proportion is slightly higher than the observed NHANES prevalence (10.97% vs 8.9%).
  • Intervals overlap slightly, but the posterior tends to overestimate diabetes compared to NHANES.

The difference could be due to: - Model assumptions (logistic link, priors) - Predictor effects (BMI, age, sex, race) - Sample characteristics vs population weighting in NHANES

Conclusion

The results find our model is reasonable but slightly conservative (predicting higher risk) relative to the observed population prevalence.

Across multiple modeling approaches—survey-weighted maximum likelihood, multiple imputation, and Bayesian regression—both age and BMI were consistently strong predictors of diabetes. Eachstandard deviation increase in age nearly tripled the odds of diabetes, while a similar increase in BMI elevated the odds by approximately 1.7–1.9 times. The consistency of these results across models highlights the robustness of the associations and underscores the importance of age and BMI as key risk factors for diabetes in this population.

Effect stability: point estimates in rhe Bayesian model’s closely aligned with those from the frequentist, indicating that prior regularization stabilized the estimates in the presence of modest missingness.

Uncertainty quantification: Bayesian credible intervals of odds ration were slightly narrower yet overlapped the frequentist confidence intervals, suggest comparable inferential precision while offering improved interpretability.

Design considerations: # Survey-weighted MLE (Maximum Likelihood Estimator) - incorporates each observation weighted according to its survey weight. - provide estimates that reflect the population-level parameters, not just the sample- produces population-representative estimates. # Bayesian model with normalized weights- - instead of fully modeling the survey design, it used normalized sampling weights as importance weights - the scaled weights that sum to the sample size approximates the effect of survey weights, but does not fully account for: Stratification, clustering, design-based variance adjustments. - Bayesian inference treats the weighted likelihood as from a regular model, ignoring some survey design features.

Autocorrelation Interpretation Autocorrelation shows how correlated a sample is with previous samples (lags) in the MCMC chain. Lag indicates the number of steps between samples. Lag 0 is always 1 (perfect correlation with itself). Each chain (1–4) shown separately for b_age_c (age coefficient) and b_bmi_c (BMI coefficient) presents autocorrelation drops quickly to near zero after lag 1–2 for both coefficients in all chains suggesting good mixing: successive samples are mostly independent after a short lag. No persistent high autocorrelation indicates MCMC chains are converging well.Low autocorrelation, as in your plot, is desirable because it means your posterior estimates are reliable and not biased by chain dependence.

Discussions

The use of multiple imputation allowed for robust analysis despite missing data, increasing power and reducing bias. Comparison of frequentist and Bayesian models demonstrated consistency in significant predictors, while Bayesian approaches provided the advantage of posterior distributions and probabilistic interpretation. The = Across all models, both age and BMI emerged as strong and consistent predictors of diabetes. The consistency across modeling approaches strengthens the validity of these findings Multiple imputation accounted for potential biases due to missing data, and Bayesian modeling provided robust credible intervals that closely matched frequentist estimates. align with previous epidemiological research indicating that increasing age and higher BMI are among the most important determinants of type 2 diabetes risk.Cumulative exposure to metabolic and lifestyle risk factors over time, and the role of excess adiposity and insulin related effects account for diabetes. Survey weighted dataset strenghthens ensuring population representativeness, multiple imputation to handle missing data, and rigorous Bayesian estimation provided high effective sample sizes and R̂ ≈ 1.00 across parameters confirmed excellent model convergence. Bayesian logistic regression provided inference statistically consistent and interpretable achieving the aim of this study. In future research hierarchical model using NHANES cycles or adding variables (lab tests) could assess nonlinear effects of metabolic risk factors.

Limitations

  1. The study is based on cross-sectional/observational NHANES data, which limits the ability to make causal inferences. Associations observed between BMI, age, diabetes status cannot confirm causation.
  2. The project relies on multiple imputation for missing values, even though imputation reduces bias, it assumes missingness is at random (MAR); if data are missing not at random (MNAR), results may be biased.
  3. Potential Residual Confounding - Models included key predictors (age, BMI, sex, race), but unmeasured factors like diet, physical activity, socioeconomic status, or genetic predisposition could confound associations.
  4. Bayesian models depend on prior choices, which could influence posterior estimates if priors are informative or mis-specified.
  5. Variable Measurement - BMI is measured at a single time point, which may not reflect long-term exposure or risk.
  6. Self-reported variables - are subjective to recall or reporting bias.
  7. Interactions are not tested in the study (bmi × age) and so other potential synergistic effects might be missed.
  8. Predicted probabilities are model-based estimates, not validated for clinical decision-making. External validation in independent cohorts is needed before application.

QUESTION for targeted therapy

Translational Perspective from the Bayesian Diabetes Prediction Project. This project further demonstrates the translational potential of Bayesian modeling in clinical decision-making and public health strategy. By using patient-level predictors such as age, BMI, sex, and race to estimate the probability of diabetes, the model moves beyond descriptive statistics toward individualized risk prediction. The translational move lies in converting these probabilistic outputs into actionable thresholds—such as identifying the BMI or age at which the predicted risk of diabetes exceeds a clinically meaningful level (e.g., 30%). Such insights can guide early screening, personalized lifestyle interventions, and targeted prevention programs for populations at higher risk. This approach embodies precision public health—bridging data science and medical decision-making to deliver tailored, evidence-based strategies that can ultimately improve diabetes prevention and management outcomes.

What changes in modifiable predictors would lower diabetes risk?

Translational Research Implications:

  • We now use the model to guide prevention or intervention.
  • Only BMI is a modifiable risk factor here
  • What must change in BMI, behavior, or lifestyle to achieve a lower risk threshold? In practice, we hold non modifiable predictors as constant (sex, race). Vary modifiable predictors (BMI) until the model predicts the desired probability.
Code
# Use the first participant 
# using multiple covariates to select someone
participant1_data  <- adult[1, ]


# predicted probabilities for patient 1
phat1 <- posterior_linpred(bayes_fit, newdata = participant1_data, transform = TRUE)
# 'transform = TRUE' gives probabilities for logistic regression

# Store in a data frame for plotting
post_pred_df <- data.frame(pred = phat1)

# Compute 95% credible interval
ci_95_participant1 <- quantile(phat1, c(0.025, 0.975))

# Plot

ggplot(post_pred_df, aes(x = pred)) + 
  geom_density(color='darkblue', fill='lightblue') +
  geom_vline(xintercept = ci_95_participant1[1], color='red', linetype='dashed') +
  geom_vline(xintercept = ci_95_participant1[2], color='red', linetype='dashed') +
  xlab('Probability of being diabetic (Outcome=1)') +
  ggtitle('Posterior Predictive Distribution 95% Credible Interval') +
  theme_bw()

Code
participant2_data  <- adult[2, ]


# predicted probabilities for patient 1
phat2 <- posterior_linpred(bayes_fit, newdata = participant2_data, transform = TRUE)
# 'transform = TRUE' gives probabilities for logistic regression

# Store in a data frame for plotting
post_pred_df2 <- data.frame(pred = phat2)

# Compute 95% credible interval
ci_95_participant2 <- quantile(phat2, c(0.025, 0.975))

# Plot

ggplot(post_pred_df2, aes(x = pred)) + 
  geom_density(color='darkblue', fill='lightblue') +
  geom_vline(xintercept = ci_95_participant2[1], color='red', linetype='dashed') +
  geom_vline(xintercept = ci_95_participant2[2], color='red', linetype='dashed') +
  xlab('Probability of being diabetic (Outcome=1)') +
  ggtitle('Posterior Predictive Distribution 95% Credible Interval') +
  theme_bw()

Code
library(ggplot2)


new_participant <- data.frame(
  age_c = 40,
  bmi_c = 25,
  sex   = "Female",
  race  = "Mexican American"
)

# Posterior predicted probabilities
phat_new <- posterior_linpred(bayes_fit, newdata = new_participant, transform = TRUE)

# Convert to numeric vector
phat_vec <- as.numeric(phat_new)

# Check the range to see if all values are similar
range(phat_vec)
[1] 1 1
Code
# Store in a data frame
post_pred_df_new <- data.frame(pred = phat_vec)

# Compute 95% credible interval from the vector
ci_95_new_participant <- quantile(phat_vec, c(0.025, 0.975))

# Plot
ggplot(post_pred_df_new, aes(x = pred)) + 
  geom_density(color='darkblue', fill='lightblue', alpha = 0.6) +
  geom_vline(xintercept = ci_95_new_participant[1], color='red', linetype='dashed') +
  geom_vline(xintercept = ci_95_new_participant[2], color='red', linetype='dashed') +
  xlim(0, 1) +  # ensures you see the curve even if values are close
  xlab('Probability of being diabetic (Outcome=1)') +
  ggtitle('Posterior Predictive Distribution (95% Credible Interval)') +
  theme_bw()

Code
# Grid of possible BMI values (centered if model used bmi_c)
bmi_seq <- seq(18, 40, by = 0.5)

newdata_grid <- data.frame(
  age_c = 40,
  bmi_c = bmi_seq,
  sex   = "Female",
  race  = "Mexican American"
)


# Posterior mean predicted probabilities
pred_probs <- posterior_linpred(bayes_fit, newdata = newdata_grid, transform = TRUE)
# Average over posterior draws to get the mean predicted probability per BMI
prob_mean <- colMeans(pred_probs)

# Combine with BMI values
pred_df <- cbind(newdata_grid, prob_mean)

target_prob <- 0.3
closest <- pred_df[which.min(abs(pred_df$prob_mean - target_prob)), ]

closest
  age_c bmi_c    sex             race prob_mean
1    40    18 Female Mexican American         1
Code
ggplot(pred_df, aes(x = bmi_c, y = prob_mean)) +
  geom_line(color = "darkblue", linewidth = 1.2) +
  geom_hline(yintercept = target_prob, color = "red", linetype = "dashed") +
  geom_vline(xintercept = closest$bmi_c, color = "red", linetype = "dotted") +
  annotate("text", x = closest$bmi_c, y = target_prob + 0.05,
           label = paste0("Target BMI ≈ ", round(closest$bmi_c, 1)),
           color = "red", hjust = -0.1) +
  labs(
    x = "BMI (centered or raw, depending on model)",
    y = "Predicted Probability of Diabetes",
    title = "Inverse Prediction: BMI Needed for Target Diabetes Risk"
  ) +
  theme_bw()

Implications

  • age and BMI as robust and independent predictors of diabetes, underscore the importance of early targeted interventions in mitigating diabetes risk.
  • Longitudinal studies and combining other statistical analytical methods with Bayesian can further enhance and provide better informed precision prevention strategies.

References

Austin, Peter C., Ian R. White, Douglas S. Lee, and Stef van Buuren. 2021. Missing Data in Clinical Research: A Tutorial on Multiple Imputation.” Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. Vital and Health Statistics Reports Series 2, Number 161, September 2013.” National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.” Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.” Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. A tutorial on Bayesian Normal linear regression.” Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Leeuw, Christiaan de, and Irene Klugkist. 2012. Augmenting Data With Published Results in Bayesian Linear Regression.” Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Yi Ming, Sam Li Sheng Chen, Amy Ming Fang Yen, and Hsiu Hsi Chen. 2013. Individual risk prediction model for incident cardiovascular disease: A Bayesian clinical reasoning approach.” International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/J.IJCARD.2012.05.016.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. Bayesian statistics and modelling.” Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Schoot, Rens Van De, David Kaplan, Jaap Denissen, Jens B Asendorpf, and Marcel A G Van Aken. 2013. “A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research.” https://doi.org/10.1111/cdev.12169.
Zeger, Scott L, Zhenke Wu, Yates Coley, Anthony Todd Fojo, Bal Carter, Katherine O’Brien, Peter Zandi, et al. 2020. Using a Bayesian Approach to Predict Patients’ Health and Response to Treatment,” no. 2020. http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=reference&D=medp&NEWS=N&AN=37708307.